Exploratory analysis of digits dataset

1. Import packages


In [2]:
%matplotlib inline
# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. load, inspect data


In [6]:
train = pd.read_csv('../data/train.csv')

In [7]:
train.head()


Out[7]:
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns


In [63]:
x_4 = train.ix[3].values[1:]
x_1 = train.ix[2].values[1:]
four = np.vstack([x_4,x_4,x_4]).T.reshape(28,28,3)
one = np.vstack([x_1,x_1,x_1]).T.reshape(28,28,3)

In [64]:
show1 = plt.imshow(one)



In [65]:
show4 = plt.imshow(four)


This tells us:

  1. the image data represent the edges of numbers
  2. the edges are not always complete
  3. the intensity values for a pixel range from 0 to 255
  4. there is only one channel to deal with in the image arrays
  5. the images are sized 28 by 28 pixels

And this raises some questions:

  1. are the images generally positioned in the center of the image?
    • if not, the classification might be improved by detecting the 'center of mass' for a number, starting from there could be useful
  2. are the images generally positioned upright?
    • row 0's 1 was at quite an angle. could the analysis benefit from some kind of regularization that sets straight lines to vertical?
  3. are numbers in the images generally of the same size?
  4. does the number of pixels that are not zero correllate well with the number / a type of number?
  5. if I run a simple linear regression on the rows here, do any of these questions matter?